A general-purpose sentence-level nonsense detector

نویسنده

  • Ian Tenney
چکیده

I have constructed a sentence-level nonsense detector, with the goal of discriminating well-formed English sentences from the large volume of fragments, headlines, incoherent drivel, and meaningless snippets present in internet text. For many NLP tasks, the availability of large volumes of internet text is enormously helpful in combating the sparsity problem inherent in modeling language. However, the derived models can be easily polluted by ungrammatical text and spam, and automated means are necessary to filter this and provide high-quality training data. There is scarce precedent in the literature for a direct nonsense-detection system, but similar problems exist in the context of spam filtering and computational sentence completion. For spam filtering, recent research has focused on the problem of “Bayesian poisoning” (Hayes, 2007), whereby Naive Bayes-based filters are defeated by the inclusion of random words or text snippets that make the spam vocabulary appear similar to normal text. Solutions to this problem propose combining multiple sources of information, such as higher n-grams, in order to introduce greater context (Upasana and Chakravarty, 2010). Sentence completion is an example of a standard NLP task where the system must consider a variety of possible solutions and discriminate between “sensical” and “nonsensical” answers. While generally operating on a more restricted domain (e.g. SAT questions), the fundamental goal is similar enough that similar techniques can be applied. The Microsoft Research Sentence Completion Challenge (Zweig and Burges, 2011) highlights several approaches to this task, including neural networks and dependency parsing but also showing strong performance with lighterweight n-gram and distributional models (Zweig et al., 2012). I take a lightweight approach, using a mix of heuristics, lightweight token-based features, part-of-speech features, and language model scores as features to avoid computationally-intensive parsing or neural networks. I structure my system as a binary classification task on a heterogeneous feature space, consisting of to produce a final answer of “sentence” or “nonsense” for a given line of text. I implement this project in a mix of Java and Python, using Java to interface with Stanford CoreNLP for feature extraction, and Python (with the excellent pandas and scikit-learn libraries) for data management, classifier implementation, and analysis.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sentence comprehension in children with specific language impairment: the role of phonological working memory.

This study examined the influence of phonological working memory on sentence comprehension in children with specific language impairment (SLI). Fourteen children with SLI and 13 with normal language (NL) participated in two tasks. In the first, a nonsense word repetition task (index of phonological working memory), subjects repeated nonsense words varying in length from one syllable to four. In...

متن کامل

بررسی الگوهای ذهنی طرحواره‌ای کمال‌گرایی و تأیید خواهی در افسردگی

AbstractObjectives: The purpose of this research is to investigate two different perspectives on depressive thinking. One viewpoint considers depression as a reflection of increasing general accessibility of negative constructs and depressive memories the other defines depressive thoughts as a reflection of changes at a more general level of cognitive representation. Method: 54 subjects selecte...

متن کامل

Effects of semantic and acoustic context on nonword detection in children with hearing loss.

PURPOSE Children with hearing loss (HL) are known to have smaller receptive vocabularies than children with normal hearing (NH). This may be due, in part, of their reduced exposure to new words and their slower rate of word learning. A necessary prerequisite to lexical development is the detection of new words in conversation. The purpose of this study was to examine the effects of HL on childr...

متن کامل

A.: Data-Intensive Question Answering

In Exercise 1.13 we showed that the Upward Löwenheim-Skolem Theorem fails for L ω1,ω by giving a sentence with models of size 2 ℵ0 but no larger models. In this section we will show that their is a cardinal κ such that for all φ ∈ L ω1,ω if φ has a model of cardinality κ, then φ has models of all infinite cardinalities. We call such a κ the Hanf number of L ω1,ω. It is general nonsense that the...

متن کامل

Photon dosimetry based on selective data sampling for the NaI(TL) detector

Radiation detection is essential for determining of radiation dose. Depend on the detector and dosimetry method, detection process is performed in different levels. Pulse counting is the first level of detection. Typically, the output of a radiation detector for determining value of the radiation dose cannot be used directly. Through changing the response function or the readout detector, is tr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014